Search CORE

9 research outputs found

A Bayesian Approach to Graphical Record Linkage and Deduplication

Author: Fienberg SE
Hall R
Steorts RC
Publication venue
Publication date: 01/10/2016
Field of study

© 2016 American Statistical Association.We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online

DukeSpace

SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication

Author: Fienberg SE
Hall R
Steorts RC
Publication venue
Publication date
Field of study

We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate

k

-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data

DukeSpace

Generalized Bayesian Record Linkage and Regression with Exact Error Propagation

Author: A Tancredi
B Liseo
G Kim
H Goldstein
H Yamato
J Copas
J Pitman
M Hof
M Sadinle
P Blasi De
P Christen
P Lahiri
R Gutman
R Gutman
RC Steorts
RC Steorts
RM Neal
SN MacEachern
Publication venue
Publication date: 01/01/2018
Field of study

Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive, or post-linkage task on the linked data. One goal of the downstream task is obtaining a larger reference data set, allowing one to perform more accurate statistical analyses. In addition, there is inherent record linkage uncertainty passed to the downstream task. Motivated by the above, we propose a generalized Bayesian record linkage method and consider multiple regression analysis as the downstream task. Records are linked via a random partition model, which allows for a wide class to be considered. In addition, we jointly model the record linkage and downstream task, which allows one to account for the record linkage uncertainty exactly. Moreover, one is able to generate a feedback propagation mechanism of the information from the proposed Bayesian record linkage model into the downstream task. This feedback effect is essential to eliminate potential biases that can jeopardize resulting downstream task. We apply our methodology to multiple linear regression, and illustrate empirically that the "feedback effect" is able to improve the performance of record linkage.Comment: 18 pages, 5 figure

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Using metric space indexing for complete and efficient record linkage

Author: A Reid
B Ramadan
C Li
D Hand
G Papadakis
GR Hjaltason
H Newcombe
IP Fellegi
L Bo
P Christen
P Christen
P Zezula
Q Wang
R Connor
R Connor
RC Steorts
V Levenshtein
XL Dong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.Postprin

Crossref

University of St. Andrews - Pure

St Andrews Research Repository